{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dd006f66",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/node_parsers/topic_parser.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d617ade9-796f-431f-86ff-6b865e0eb007",
   "metadata": {},
   "source": [
    "# TopicNodeParser\n",
    "\n",
    "[MedGraphRAG](https://arxiv.org/html/2408.04187) aims to improve the capabilities of LLMs in the medical domain by generating evidence-based results through a novel graph-based Retrieval-Augmented Generation framework, improving safety and reliability in handling private medical data.\n",
    "\n",
    "`TopicNodeParser` implements an approximate version of the chunking technique described in the paper.\n",
    "\n",
    "Here is the technique as outlined in the paper:\n",
    "\n",
    "```\n",
    "Large medical documents often contain multiple themes or diverse content. To process these effectively, we first segment the document into data chunks that conform to the context limitations of Large Language Models (LLMs). Traditional methods such as chunking based on token size or fixed characters typically fail to detect subtle shifts in topics accurately. Consequently, these chunks may not fully capture the intended context, leading to a loss in the richness of meaning.\n",
    "\n",
    "To enhance accuracy, we adopt a mixed method of character separation coupled with topic-based segmentation. Specifically, we utilize static characters (line break symbols) to isolate individual paragraphs within the document. Following this, we apply a derived form of the text for semantic chunking. Our approach includes the use of proposition transfer, which extracts standalone statements from a raw text Chen et al. (2023). Through proposition transfer, each paragraph is transformed into self-sustaining statements. We then conduct a sequential analysis of the document to assess each proposition, deciding whether it should merge with an existing chunk or initiate a new one. This decision is made via a zero-shot approach by an LLM. To reduce noise generated by sequential processing, we implement a sliding window technique, managing five paragraphs at a time. We continuously adjust the window by removing the first paragraph and adding the next, maintaining focus on topic consistency. We set a hard threshold that the longest chunk cannot excess the context length limitation of LLM. After chunking the document, we construct graph on each individual of the data chunk.\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d1c5118",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index llama-index-node-parser-topic"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12dcc784-f2c6-4c37-8771-57a921ff2eab",
   "metadata": {},
   "source": [
    "## Setup Data\n",
    "\n",
    "Here we consider a sample text.\n",
    "\n",
    "Note: The propositions were created by an LLM, which might lead to longer processing times when creating nodes. Exercise caution while experimenting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fdcd874",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"\"\"In this paper, we introduce a novel graph RAG method for applying LLMs to the medical domain, which we refer to as Medical Graph RAG (MedRAG). This technique improves LLM performance in the medical domain by response queries with grounded source citations and clear interpretations of medical terminology, boosting the transparency and interpretability of the results. This approach involves a three-tier hierarchical graph construction method. Initially, we use documents provided by users as our top-level source to extract entities. These entities are then linked to a second level consisting of more basic entities previously abstracted from credible medical books and papers. Subsequently, these entities are connected to a third level—the fundamental medical dictionary graph—that provides detailed explanations of each medical term and their semantic relationships. We then construct a comprehensive graph at the highest level by linking entities based on their content and hierarchical connections. This method ensures that the knowledge can be traced back to its sources and the results are factually accurate.\n",
    "\n",
    "To respond to user queries, we implement a U-retrieve strategy that combines top-down retrieval with bottom-up response generation. The process begins by structuring the query using predefined medical tags and indexing them through the graphs in a top-down manner. The system then generates responses based on these queries, pulling from meta-graphs—nodes retrieved along with their TopK related nodes and relationships—and summarizing the information into a detailed response. This technique maintains a balance between global context awareness and the contextual limitations inherent in LLMs.\n",
    "\n",
    "Our medical graph RAG provides Intrinsic source citation can enhance LLM transparency, interpretability, and verifiability. The results provides the provenance, or source grounding information, as it generates each response, and demonstrates that an answer is grounded in the dataset. Having the cited source for each assertion readily available also enables a human user to quickly and accurately audit the LLM’s output directly against the original source material. It is super useful in the field of medicine that security is very important, and each of the reasoning should be evidence-based. By using such a method, we construct an evidence-based Medical LLM that the clinician could easiely check the source of the reasoning and calibrate the model response to ensure the safty usage of llm in the clinical senarios.\n",
    "\n",
    "To evaluate our medical graph RAG, we implemented the method on several popular open and closed-source LLMs, including ChatGPT OpenAI (2023a) and LLaMA Touvron et al. (2023), testing them across mainstream medical Q&A benchmarks such as PubMedQA Jin et al. (2019), MedMCQA Pal et al. (2022), and USMLE Kung et al. (2023). For the RAG process, we supplied a comprehensive medical dictionary as the foundational knowledge layer, the UMLS medical knowledge graph Lindberg et al. (1993) as the foundamental layer detailing semantic relationships, and a curated MedC-K dataset Wu et al. (2023) —comprising the latest medical papers and books—as the intermediate level of data to simulate user-provided private data. Our experiments demonstrate that our model significantly enhances the performance of general-purpose LLMs on medical questions. Remarkably, it even surpasses many fine-tuned or specially trained LLMs on medical corpora, solely using the RAG approach without additional training.\n",
    "\"\"\"\n",
    "\n",
    "from llama_index.core import Document\n",
    "\n",
    "documents = [Document(text=text)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "717bd52c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In this paper, we introduce a novel graph RAG method for applying LLMs to the medical domain, which we refer to as Medical Graph RAG (MedRAG). This technique improves LLM performance in the medical domain by response queries with grounded source citations and clear interpretations of medical terminology, boosting the transparency and interpretability of the results. This approach involves a three-tier hierarchical graph construction method. Initially, we use documents provided by users as our top-level source to extract entities. These entities are then linked to a second level consisting of more basic entities previously abstracted from credible medical books and papers. Subsequently, these entities are connected to a third level—the fundamental medical dictionary graph—that provides detailed explanations of each medical term and their semantic relationships. We then construct a comprehensive graph at the highest level by linking entities based on their content and hierarchical connections. This method ensures that the knowledge can be traced back to its sources and the results are factually accurate.\n",
      "\n",
      "To respond to user queries, we implement a U-retrieve strategy that combines top-down retrieval with bottom-up response generation. The process begins by structuring the query using predefined medical tags and indexing them through the graphs in a top-down manner. The system then generates responses based on these queries, pulling from meta-graphs—nodes retrieved along with their TopK related nodes and relationships—and summarizing the information into a detailed response. This technique maintains a balance between global context awareness and the contextual limitations inherent in LLMs.\n",
      "\n",
      "Our medical graph RAG provides Intrinsic source citation can enhance LLM transparency, interpretability, and verifiability. The results provides the provenance, or source grounding information, as it generates each response, and demonstrates that an answer is grounded in the dataset. Having the cited source for each assertion readily available also enables a human user to quickly and accurately audit the LLM’s output directly against the original source material. It is super useful in the field of medicine that security is very important, and each of the reasoning should be evidence-based. By using such a method, we construct an evidence-based Medical LLM that the clinician could easiely check the source of the reasoning and calibrate the model response to ensure the safty usage of llm in the clinical senarios.\n",
      "\n",
      "To evaluate our medical graph RAG, we implemented the method on several popular open and closed-source LLMs, including ChatGPT OpenAI (2023a) and LLaMA Touvron et al. (2023), testing them across mainstream medical Q&A benchmarks such as PubMedQA Jin et al. (2019), MedMCQA Pal et al. (2022), and USMLE Kung et al. (2023). For the RAG process, we supplied a comprehensive medical dictionary as the foundational knowledge layer, the UMLS medical knowledge graph Lindberg et al. (1993) as the foundamental layer detailing semantic relationships, and a curated MedC-K dataset Wu et al. (2023) —comprising the latest medical papers and books—as the intermediate level of data to simulate user-provided private data. Our experiments demonstrate that our model significantly enhances the performance of general-purpose LLMs on medical questions. Remarkably, it even surpasses many fine-tuned or specially trained LLMs on medical corpora, solely using the RAG approach without additional training.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(documents[0].get_content())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b7ac8b7",
   "metadata": {},
   "source": [
    "## Setup LLM And Embedding Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "886c7682",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\"  # Replace with your OpenAI API key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b082912",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "from llama_index.llms.openai import OpenAI\n",
    "\n",
    "embed_model = OpenAIEmbedding()\n",
    "llm = OpenAI(model=\"gpt-4o-mini\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd21470a-a6b4-43aa-94d6-503860706404",
   "metadata": {},
   "source": [
    "## Define TopicNodeParser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b204c43c-f98a-47fb-b84c-1ed5e07c7f4a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.node_parser.topic import TopicNodeParser"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b4959ff",
   "metadata": {},
   "source": [
    "### LLM based topic similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf398967-74b0-4fe1-a6ee-2da246d33757",
   "metadata": {},
   "outputs": [],
   "source": [
    "node_parser = TopicNodeParser.from_defaults(\n",
    "    llm=llm,\n",
    "    max_chunk_size=1000,\n",
    "    similarity_method=\"llm\",  # can be \"llm\" or \"embedding\"\n",
    "    window_size=2,  # paper suggests window_size=5\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9cc498a-9d75-4a87-b8a8-fcc995872a4b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "efb00dbe0b894c97bd5d33b86c2d6d45",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "nodes = node_parser.get_nodes_from_documents(documents, show_progress=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d805e2be",
   "metadata": {},
   "source": [
    "#### Let's inspect chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b4b2768",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This paper introduces a novel graph RAG method for applying LLMs to the medical domain. The novel graph RAG method is referred to as Medical Graph RAG (MedRAG). The Medical Graph RAG technique improves LLM performance in the medical domain. The Medical Graph RAG technique responds to queries with grounded source citations. The Medical Graph RAG technique provides clear interpretations of medical terminology. The Medical Graph RAG technique boosts the transparency of the results. The Medical Graph RAG technique boosts the interpretability of the results. The Medical Graph RAG approach involves a three-tier hierarchical graph construction method.\n"
     ]
    }
   ],
   "source": [
    "print(nodes[0].get_content())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24a12494",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Documents provided by users are used as the top-level source to extract entities. The extracted entities are linked to a second level consisting of more basic entities. The more basic entities are previously abstracted from credible medical books and papers. The extracted entities are connected to a third level, which is the fundamental medical dictionary graph. The fundamental medical dictionary graph provides detailed explanations of each medical term. The fundamental medical dictionary graph provides the semantic relationships of medical terms.\n"
     ]
    }
   ],
   "source": [
    "print(nodes[1].get_content())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbc3d733",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A comprehensive graph is constructed at the highest level by linking entities based on their content. A comprehensive graph is constructed at the highest level by linking entities based on their hierarchical connections.\n"
     ]
    }
   ],
   "source": [
    "print(nodes[2].get_content())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "781ecdab",
   "metadata": {},
   "source": [
    "### Embedding based topic similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5311623",
   "metadata": {},
   "outputs": [],
   "source": [
    "node_parser = TopicNodeParser.from_defaults(\n",
    "    embed_model=embed_model,\n",
    "    llm=llm,\n",
    "    max_chunk_size=1000,\n",
    "    similarity_method=\"embedding\",  # can be \"llm\" or \"embedding\"\n",
    "    similarity_threshold=0.8,\n",
    "    window_size=2,  # paper suggests window_size=5\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b4016d1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2921c85be8984ed49fd63407d8d27497",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "nodes = node_parser.get_nodes_from_documents(documents, show_progress=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1f91de7",
   "metadata": {},
   "source": [
    "#### Let's inspect chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52c1bdeb-02ff-4d8b-af6a-2196a726b8b8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This paper introduces a novel graph RAG method for applying LLMs to the medical domain. The novel graph RAG method is referred to as Medical Graph RAG (MedRAG). The Medical Graph RAG technique improves LLM performance in the medical domain. The Medical Graph RAG technique responds to queries with grounded source citations. The Medical Graph RAG technique provides clear interpretations of medical terminology. The Medical Graph RAG technique boosts the transparency of the results. The Medical Graph RAG technique boosts the interpretability of the results. The Medical Graph RAG approach involves a three-tier hierarchical graph construction method.\n"
     ]
    }
   ],
   "source": [
    "print(nodes[0].get_content())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10492cef-9e5e-468e-a165-a666347cdfa0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Documents provided by users are used as the top-level source to extract entities. The extracted entities are linked to a second level consisting of more basic entities. The more basic entities are previously abstracted from credible medical books and papers. The extracted entities are connected to a third level, which is the fundamental medical dictionary graph. The fundamental medical dictionary graph provides detailed explanations of each medical term. The fundamental medical dictionary graph provides semantic relationships between medical terms. A comprehensive graph is constructed at the highest level by linking entities based on their content. A comprehensive graph is constructed at the highest level by linking entities based on their hierarchical connections. The Medical Graph RAG method ensures that the knowledge can be traced back to its sources. The Medical Graph RAG method ensures that the results are factually accurate.\n"
     ]
    }
   ],
   "source": [
    "print(nodes[1].get_content())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "57ef3df6-7bf8-49c2-8b85-46f0dc8ac51d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The U-retrieve strategy is implemented to respond to user queries. The U-retrieve strategy combines top-down retrieval with bottom-up response generation.\n"
     ]
    }
   ],
   "source": [
    "print(nodes[2].get_content())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llamaindex",
   "language": "python",
   "name": "llamaindex"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
